In [5]:
import numpy as np
import pandas as pd
from datetime import datetime
from glob import glob
from StringIO import StringIO

''' created quick class based on August 1st Analysis '''
from MTAReader import MTAReader

Process All Data Over Range

created quick class MTAReader based on August 1st Analysis. Should make notebook easier to follow.

Ideally we could just read in the first and then last file, but I found errors in the first MTA file for august 1st where the odometer counters were reset, so we do indeed need to read in the full files... Or we could potential lose up to N days of readings without ever knowing

Note, know I have to fix the NaN rows, which represent the ~4 hour gap between the last row in the first file and then first row in next file. These files account for a total of 70.4 MB of data, so I decided it is quickest and most plausible to ignore this issue completely at this stage - instead combining the files into one big string. I'm running this on my old Macbook pro and it's fairly quick. Problem is this does not scale if we go to N files, but 80 - 20 rule and the analysis will be the same. For the interactive section SQL should make this much easier, I just think it's a bit too verbose to use in this setting at the moment. Will be sure to compare results here to my webapp, once implemented

Note: Found error on line 26668 in turnstile_130706.txt so I deleted the null characters manually, and this worked. Could use string replacement in below, but I want the data corrected later too.

In [6]:
combined_files = ''

for i_file, file_name in enumerate(glob('data/turnstile_*')):
#     if i_file == 1:
#         print('Skipping', file_name)
#         continue

#     if i_file > 0:
#         combined_files += '\n'
        
    print('Reading', file_name)
    combined_files += open(file_name).read()
    #     combined_files = open(file_name).read()

print('Len of all files together', len(combined_files))

combined_files_buffer = StringIO(combined_files)
combined_files_buffer
('Reading', 'data/turnstile_130706.txt')
('Reading', 'data/turnstile_130713.txt')
('Reading', 'data/turnstile_130720.txt')
('Reading', 'data/turnstile_130727.txt')
('Reading', 'data/turnstile_130803.txt')
('Len of all files together', 70355724)
Out[6]:
<StringIO.StringIO instance at 0x1103fd0e0>
In [7]:
mta = MTAReader('Misc', data_str=combined_files_buffer)
('Starting', 'Misc')
Generated columns
Read in raw data
Merged with stations
Types corrected
Reduced
Computted Traffic
('Fixing # rows:', 229)
Replaced bad odemeter values
('Dumped to', '2013-06-29 - 2013-08-02.csv')

Check results to August 1st

Note, they could be a little off, due to the first of the day now having previous data to compute with

In [8]:
dateparse = lambda x: pd.datetime.strptime(x, '%Y-%m-%d %H:%M:%S')

all_for_period = pd.read_csv(
    'data_sanitized/2013-07-27 - 2013-08-02.csv',
    parse_dates=['datetime', 'date'],
    date_parser=dateparse
)

august_1_2013 = all_for_period[all_for_period.date == datetime(2013, 8, 1)]
# august_1_2013 = all_for_period[all_for_period.date == '2013-08-01 00:00:00']

august_1_2013.to_csv('august_1_class.csv', index=False)

august_1_total_traffic = pd.Series({
    'exits': august_1_2013.exits.sum(),
    'entries': august_1_2013.entries.sum()
})
august_1_total_traffic['total'] = august_1_total_traffic.sum()
august_1_total_traffic
Out[8]:
entries     5922368
exits       4588930
total      10511298
dtype: float64
In [42]:
august_1_2013_from_big = mta.df[mta.df.date == datetime(2013, 8, 1)]

august_1_total_traffic_big = pd.Series({
    'exits': august_1_2013_from_big.exits.sum(),
    'entries': august_1_2013_from_big.entries.sum()
})
august_1_total_traffic_big['total'] = august_1_total_traffic_big.sum()
august_1_total_traffic_big
Out[42]:
entries     5900665
exits       4560448
total      10461113
dtype: float64
In [43]:
difference_between_results = august_1_total_traffic - august_1_total_traffic_big
difference_between_results
Out[43]:
entries    21703
exits      28482
total      50185
dtype: float64
In [44]:
normalized_difference = difference_between_results / august_1_total_traffic
normalized_difference
Out[44]:
entries    0.003665
exits      0.006207
total      0.004774
dtype: float64
In [45]:
df_by_station = august_1_2013.groupby(['station', 'scp'])

scp_by_usage = []
for (i_station, ((station, scp), data)) in enumerate(df_by_station):
#     if i_station > 2:
#         break
    
    scp_by_usage.append({
        'scp': scp,
        'station': station,
        'cummulative_flow': data.cummulative_flow.sum()
    })

scp_by_usage = pd.DataFrame(scp_by_usage)
scp_by_usage = scp_by_usage.sort(
    ['cummulative_flow', 'scp', 'station'],
    ascending=False
).reset_index(drop=True)
scp_by_usage[0:10]
Out[45]:
cummulative_flow scp station
0 303459 00-00-01 PAVONIA/NEWPORT
1 133520 00-01-00 EXCHANGE PLACE
2 49188 00-04-00 EXCHANGE PLACE
3 35956 00-00-00 42 ST-GRD CNTRL
4 30232 00-00-01 42 ST-GRD CNTRL
5 22210 00-03-00 42 ST-GRD CNTRL
6 21123 00-03-03 42 ST-GRD CNTRL
7 20502 00-03-04 42 ST-GRD CNTRL
8 20002 00-00-00 34 ST-PENN STA
9 19824 00-00-00 125 ST
In [47]:
df_by_station = august_1_2013_from_big.groupby(['station', 'scp'])

scp_by_usage_big = []
for (i_station, ((station, scp), data)) in enumerate(df_by_station):
#     if i_station > 2:
#         break
    
    scp_by_usage_big.append({
        'scp': scp,
        'station': station,
        'cummulative_flow': data.cummulative_flow.sum()
    })

scp_by_usage_big = pd.DataFrame(scp_by_usage_big)
scp_by_usage_big = scp_by_usage_big.sort(
    ['cummulative_flow', 'scp', 'station'],
    ascending=False
).reset_index(drop=True)
scp_by_usage_big[0:10]
Out[47]:
cummulative_flow scp station
0 303459 00-00-01 PAVONIA/NEWPORT
1 133520 00-01-00 EXCHANGE PLACE
2 49188 00-04-00 EXCHANGE PLACE
3 35956 00-00-00 42 ST-GRD CNTRL
4 30205 00-00-01 42 ST-GRD CNTRL
5 22210 00-03-00 42 ST-GRD CNTRL
6 21123 00-03-03 42 ST-GRD CNTRL
7 20502 00-03-04 42 ST-GRD CNTRL
8 20002 00-00-00 34 ST-PENN STA
9 19824 00-00-00 125 ST
In [48]:
difference_stations = pd.merge(
    scp_by_usage,
    scp_by_usage_big,
    left_on=["scp", "station"],
    right_on=["scp", "station"],
    how="left"
)
In [49]:
difference_stations['difference'] = difference_stations.cummulative_flow_x - difference_stations.cummulative_flow_y
In [50]:
difference_stations = difference_stations.sort('difference', ascending=False)
difference_stations.difference.sum()
Out[50]:
22163.0

Analysis of results between August 1st data alone and combined

The total counts are slightly off (by < 1%), but the top stations are the same. This is probably due to slight difference between those small stations which have "bad" hours and are inconsistent. This could be because the odometer readings were reset and they picked up additional information from the days previous

Sanity Checks on data

In [51]:
%matplotlib inline
mta.df[(mta.df.scp == '00-00-01') & (mta.df.station == '1 AVE')].head().plot(x='datetime', y='entries')
Out[51]:
<matplotlib.axes._subplots.AxesSubplot at 0x10fc11590>
In [52]:
mta.df[(mta.df.scp == '00-00-01') & (mta.df.station == '1 AVE')].plot(x='datetime', y='entries')
Out[52]:
<matplotlib.axes._subplots.AxesSubplot at 0x102b1bcd0>
In [53]:
test = mta.df[['datetime', 'entries', 'exits', 'station', 'date']]#.plot()#(x='datetime', y='entries')
In [54]:
test.head()
Out[54]:
datetime entries exits station date
168654 2013-06-29 00:00:00 NaN NaN 1 AVE 2013-06-29
168655 2013-06-29 04:00:00 188 342 1 AVE 2013-06-29
168656 2013-06-29 04:00:00 0 0 1 AVE 2013-06-29
168657 2013-06-29 08:00:00 104 318 1 AVE 2013-06-29
168658 2013-06-29 12:00:00 564 664 1 AVE 2013-06-29

Convert data

We need monthly data, so split by date and get into dfs for entries and exits

In [55]:
df_by_date = {
    'entries': {},
    'exits': {},
}

for (i_grouping, ((date, station), data)) in enumerate(test.groupby(['date', 'station'])):
#     print(i_grouping, _datetime, data.entries.sum())
    
    if station not in df_by_date['entries']:
        df_by_date['entries'][station] = {}
    if station not in df_by_date['exits']:
        df_by_date['exits'][station] = {}
    
    df_by_date['entries'][station][date] = data.entries.sum()
    df_by_date['exits'][station][date] = data.exits.sum()

df_by_date['entries'] = pd.DataFrame(df_by_date['entries'])
df_by_date['exits'] = pd.DataFrame(df_by_date['exits'])

df_by_date['entries'].head()
Out[55]:
1 AVE 103 ST 103 ST-CORONA 104 ST 110 ST 110 ST-CATHEDRL 110 ST-CPN 111 ST 116 ST 116 ST-COLUMBIA ... WESTCHESTER SQ WHITEHALL ST WHITLOCK AVE WILSON AVE WINTHROP ST WOODHAVEN BLVD WOODLAWN ROAD WORLD TRADE CTR YORK ST ZEREGA AVE
2013-06-29 15524 21893 13473 1402 7076 8769 5589 9177 21882 6651 ... 2959 10078 971 2446 3089 16515 4604 6237 4083 1251
2013-06-30 18325 20028 12231 1075 6846 8755 6205 8054 22183 6655 ... 2600 10708 923 2383 2784 14445 4756 6163 4213 1188
2013-07-01 20750 30153 17775 2320 9360 12233 7499 12359 29353 14116 ... 5697 13929 1492 3433 5452 23480 7606 12123 7623 2483
2013-07-02 21785 32221 18537 2404 11016 13158 7999 12842 30890 15667 ... 6109 15205 1558 3545 7081 25707 7886 13551 8548 2575
2013-07-03 22515 32573 18910 2440 11213 13468 8495 12472 31478 14785 ... 6021 12824 1586 3479 7172 26416 8017 12843 8101 2521

5 rows × 382 columns

In [56]:
df_by_date['entries'].plot( figsize=(20,10), legend=False)
Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x124232750>

OK try with subset of stations...?

In [57]:
df_by_date['entries'].columns.tolist()
quarter_cols = len(df_by_date['entries'].columns.tolist()) / 4
In [58]:
df_by_date['entries'][df_by_date['entries'].columns.tolist()[:quarter_cols]].plot( figsize=(20,10), legend=False)
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x1067f96d0>

Interesting things

Pretty cool. we can see week day commuting.

Look we found July 4th!, looks like less people tend to use the subway then. Also notice the black and yellow lines (towards bottom) which follow a different trend than the others - more traffic back on the weekends. I bet these are stations like Cony Island or other vacation like destinations.

Now on to trying to find messed up ones...

In [60]:
df_by_date['entries'][df_by_date['entries'].columns.tolist()[quarter_cols: quarter_cols * 2]].plot( figsize=(20,10), legend=False)
Out[60]:
<matplotlib.axes._subplots.AxesSubplot at 0x110893f90>
In [63]:
screwed_up_entries = df_by_date['entries'].sum()
screwed_up_entries.sort()
In [65]:
screwed_up_entries_normalized = screwed_up_entries / 1e8 # normalize by key in chart
screwed_up_entries_normalized.sort(ascending=False)
screwed_up_entries_normalized[0:11]
Out[65]:
DYRE AVE           5.331245
7 AVE              3.873054
INWOOD-207 ST      1.674412
77 ST              1.348929
FOREST HILLS-71    1.192830
BEACH 98 ST        1.174604
MARCY AVE          0.235312
34 ST-PENN STA     0.050571
MORGAN AVE         0.046240
ATLANTIC AVE       0.044216
42 ST-GRD CNTRL    0.042376
dtype: float64
In [67]:
bad_stations_for_entries = screwed_up_entries_normalized[0:11]  # 11 chosen via trial and error
bad_stations_for_entries
Out[67]:
DYRE AVE           5.331245
7 AVE              3.873054
INWOOD-207 ST      1.674412
77 ST              1.348929
FOREST HILLS-71    1.192830
BEACH 98 ST        1.174604
MARCY AVE          0.235312
34 ST-PENN STA     0.050571
MORGAN AVE         0.046240
ATLANTIC AVE       0.044216
42 ST-GRD CNTRL    0.042376
dtype: float64
In [74]:
entry_stations_to_fix = bad_stations_for_entries.index.values.tolist()
bad_stations_for_entries.index.values.tolist()
Out[74]:
['DYRE AVE',
 '7 AVE',
 'INWOOD-207 ST',
 '77 ST',
 'FOREST HILLS-71',
 'BEACH 98 ST',
 'MARCY AVE',
 '34 ST-PENN STA',
 'MORGAN AVE',
 'ATLANTIC AVE',
 '42 ST-GRD CNTRL']

Found You

Now I need to fix the mta dataset, we know the relative date range and the stations.

In [75]:
df_by_date['entries'][bad_stations_for_entries.index.values.tolist()].plot()
Out[75]:
<matplotlib.axes._subplots.AxesSubplot at 0x10ba00850>

Fix data

Fix Entries

In [77]:
df_by_date['entries'][bad_stations_for_entries.index.values.tolist()]
Out[77]:
DYRE AVE 7 AVE INWOOD-207 ST 77 ST FOREST HILLS-71 BEACH 98 ST MARCY AVE 34 ST-PENN STA MORGAN AVE ATLANTIC AVE 42 ST-GRD CNTRL
2013-06-29 2081 4673 5504 19991 14346 428 5448 75145 4395363 15460 49110
2013-06-30 1597 4351 4824 18292 13372 344 5663 87280 6305 20632 54046
2013-07-01 4502 8186 8038 38385 101782237 314 10425 172546 6571 27309 150034
2013-07-02 4902 9028 8509 41805 27511 308 11238 174234 6614 27498 156135
2013-07-03 4798 8854 8109 40023 27795 362 11352 170676 6827 27755 150063
2013-07-04 532989539 4206 5112 17335 12915 877 6407 75253 4727 14549 45802
2013-07-05 3609 6049 6619 26950 20750 1049 9313 137051 7101 22726 90008
2013-07-06 2158 4351 5161 18870 13726 983 5721 82433 5569 16688 53145
2013-07-07 1560 3897 4242 15037 11160 1061 4927 83699 4916 15319 46793
2013-07-08 4839 8433 8237 37808 16757110 553 10405 174942 6541 26900 146989
2013-07-09 4990 9005 8627 42015 27855 117440971 10957 177083 6879 27660 158829
2013-07-10 5067 8908 9147 41212 28048 376 16681609 176296 7095 26904 160562
2013-07-11 4988 9302 9126 42746 27944 370 11450 177536 7209 27038 159128
2013-07-12 4754 9085 167167057 40861 25637 411 6546419 173263 7355 30468 154783
2013-07-13 2098 4957 15675 23328 15382 305 5864 91969 6695 17207 61656
2013-07-14 1573 4198 13505 133734975 12613 662 5385 83126 5758 16013 49687
2013-07-15 4809 8648 9687 39131 27061 573 10473 176863 6660 27187 154458
2013-07-16 5060 8872 8679 44086 26753 689 10108 176538 6764 31129 162461
2013-07-17 4988 9043 9191 41597 27986 526 11185 176806 6565 27633 160498
2013-07-18 4930 9036 9096 41678 27334 555 11024 174288 7184 27785 162113
2013-07-19 4727 387057271 8571 39491 25626 660 10910 170651 6974 26323 149365
2013-07-20 2158 4605 5727 21686 14745 706 5945 83187 6308 17482 59967
2013-07-21 1633 3578 4779 17360 12405 1117 5494 79671 5453 15789 46777
2013-07-22 4858 8289 8299 38048 26638 577 10679 176095 6692 26806 150295
2013-07-23 5038 8728 8446 41931 27569 421 11258 175141 6725 26991 157126
2013-07-24 5023 8461 8990 42698 28470 474 11540 179804 7006 27303 160655
2013-07-25 4908 8785 9332 42459 28137 403 11345 176866 7070 28528 161379
2013-07-26 4786 8797 8867 40150 26759 499 11190 168340 7135 28760 149377
2013-07-27 2316 5106 6240 22878 15567 1084 6295 91574 9647 20032 60988
2013-07-28 1579 3890 4595 17237 12194 241 5254 82792 8741 19254 48520
2013-07-29 4859 8450 8492 38236 26710 449 10795 171045 6263 3599218 145385
2013-07-30 5104 9181 8889 41442 28162 497 11431 173334 6750 27818 154750
2013-07-31 4965 9263 8792 42114 28131 509 11392 174154 6904 28063 156456
2013-08-01 4845 9005 8349 42385 27782 448 11289 174118 6836 27456 165224
2013-08-02 4846 8859 8707 38662 26573 592 11055 163280 6805 27907 145079

so the bad dates to fix are:

station date
DYRE AVE 2013-07-04
7 AVE 2013-07-19
INWOOD-207 ST 2013-07-12
77 ST 2013-07-14
FOREST HILLS-71 2013-07-08
BEACH 98 ST 2013-07-09
etc etc

(worst offenders)

In [86]:
fixed_entries = df_by_date['entries']
for station in fixed_entries[entry_stations_to_fix]:
    print(station, fixed_entries[station].max(), fixed_entries[station].idxmax(), fixed_entries[station].mean())
    
    fixed_entries.xs(fixed_entries[station].idxmax())[station] = 0
    # set to mean?
    fixed_entries.xs(fixed_entries[station].idxmax())[station] = fixed_entries[station].mean()
fixed_entries
('DYRE AVE', 4988.0, Timestamp('2013-07-17 00:00:00'), 3137.4371558517282)
('7 AVE', 9005.0, Timestamp('2013-07-09 00:00:00'), 5794.4523275301954)
('INWOOD-207 ST', 9096.0, Timestamp('2013-07-18 00:00:00'), 6242.2270440649727)
('77 ST', 41931.0, Timestamp('2013-07-23 00:00:00'), 27043.723631820078)
('FOREST HILLS-71', 27944.0, Timestamp('2013-07-11 00:00:00'), 18083.533011911702)
('BEACH 98 ST', 689.0, Timestamp('2013-07-16 00:00:00'), 396.83750570595583)
('MARCY AVE', 11289.0, Timestamp('2013-08-01 00:00:00'), 7384.3716298209083)
('34 ST-PENN STA', 176095.0, Timestamp('2013-07-22 00:00:00'), 114500.77767330279)
('MORGAN AVE', 7095.0, Timestamp('2013-07-10 00:00:00'), 5473.1579315285298)
('ATLANTIC AVE', 27785.0, Timestamp('2013-07-18 00:00:00'), 19481.98981757601)
('42 ST-GRD CNTRL', 158829.0, Timestamp('2013-07-09 00:00:00'), 92808.419878384011)
Out[86]:
1 AVE 103 ST 103 ST-CORONA 104 ST 110 ST 110 ST-CATHEDRL 110 ST-CPN 111 ST 116 ST 116 ST-COLUMBIA ... WESTCHESTER SQ WHITEHALL ST WHITLOCK AVE WILSON AVE WINTHROP ST WOODHAVEN BLVD WOODLAWN ROAD WORLD TRADE CTR YORK ST ZEREGA AVE
2013-06-29 15524 21893 13473 1402 7076 8769 5589 9177 21882 6651 ... 2959 10078 971 2446 3089 16515 4604 6237 4083 1251
2013-06-30 18325 20028 12231 1075 6846 8755 6205 8054 22183 6655 ... 2600 10708 923 2383 2784 14445 4756 6163 4213 1188
2013-07-01 20750 30153 17775 2320 9360 12233 7499 12359 29353 14116 ... 5697 13929 1492 3433 5452 23480 7606 12123 7623 2483
2013-07-02 21785 32221 18537 2404 11016 13158 7999 12842 30890 15667 ... 6109 15205 1558 3545 7081 25707 7886 13551 8548 2575
2013-07-03 22515 32573 18910 2440 11213 13468 8495 12472 31478 14785 ... 6021 12824 1586 3479 7172 26416 8017 12843 8101 2521
2013-07-04 16106 20503 12407 1316 6795 8477 6097 8546 20997 6380 ... 2713 8092 1062 2529 3167 16794 4839 6133 3693 1323
2013-07-05 17598 26329 17022 1971 9735 10436 7928 11283 27878 10367 ... 4790 14461 1352 3384 5715 21937 7489 9692 5709 2171
2013-07-06 15597 19875 13824 1397 7662 8235 6266 8720 22484 6660 ... 3095 876 1091 2742 3308 16633 5610 8085 4177 1330
2013-07-07 14104 17237 10512 1088 6539 7466 5474 7176 19616 5545 ... 2300 10 982 2479 2667 14185 4650 6519 4163 1097
2013-07-08 19858 31256 17960 2398 10856 12210 8191 12906 31069 13218 ... 6093 13262 1483 3605 7337 24840 8280 12548 8356 2693
2013-07-09 21723 33269 18577 2448 11422 13268 8514 13128 31739 15937 ... 6438 15275 1522 3560 7395 25734 804 13415 8695 2792
2013-07-10 22702 33527 19268 2371 11672 13370 8843 13308 32939 15924 ... 6427 14844 1538 3600 7551 25565 NaN 12958 8756 2800
2013-07-11 23002 33342 19241 2439 11421 13437 8715 13108 32530 15877 ... 6234 14749 1583 3565 7347 25550 NaN 14192 8759 2711
2013-07-12 23917 33207 19332 2341 11408 13422 8730 13081 32992 14703 ... 6056 14874 1554 3180 6133 25833 NaN 13524 9236 2595
2013-07-13 18381 21720 14826 1440 7894 9545 6669 9669 23916 7380 ... 3176 12696 1034 2836 3267 17355 37661 7854 4905 1434
2013-07-14 17124 19193 11753 1158 7353 8054 6232 7858 21134 6129 ... 2626 12058 908 2592 2678 13824 4677 7004 4906 1095
2013-07-15 20462 31600 18223 2379 11054 12259 8661 13087 31560 13827 ... 6004 16064 1551 3754 7528 24452 7541 13389 8488 2708
2013-07-16 21609 33189 18952 2440 11452 13004 8748 12931 33399 16155 ... 6281 15352 1567 3647 8717 25562 8070 13303 8686 2747
2013-07-17 22785 33115 18529 2474 11508 13409 9167 13296 32656 15168 ... 6245 15549 1574 3668 7512 26255 8134 13722 8782 2707
2013-07-18 22823 33157 18310 2479 11423 13041 9029 13014 32584 16047 ... 6318 16210 1576 3660 7301 25693 7882 13632 8684 2718
2013-07-19 23023 32176 18745 2350 11110 12873 8782 12909 31422 13930 ... 5842 14942 1474 3669 7081 25239 7799 13013 8511 2545
2013-07-20 18167 21343 14907 1445 7979 9200 6621 9895 23512 7148 ... 3171 12482 1045 2801 3361 17171 5755 7630 5050 1375
2013-07-21 16460 18174 12202 990 6562 8060 5812 8269 20414 6467 ... 2407 11464 922 2584 2820 14306 4644 6524 5117 1191
2013-07-22 20383 31291 18193 2378 10558 12107 8350 12807 30609 14264 ... 5986 15774 1489 3623 7261 24242 7943 13093 8438 2716
2013-07-23 20900 33075 18905 2488 11047 13096 8407 12933 31456 15270 ... 6393 15486 1581 3658 8261 25354 7960 13337 8602 2709
2013-07-24 23235 33909 19420 2514 11320 13258 8853 13279 32730 15604 ... 6484 16831 1628 3678 7479 26302 8382 13872 8904 2790
2013-07-25 23103 33694 19393 2517 11405 13309 8673 13275 32878 15619 ... 6318 15386 1597 3647 7259 25977 7911 13833 9037 2761
2013-07-26 23219 33007 18867 2380 11246 13180 8572 12985 32131 14122 ... 6281 15974 1590 3671 7200 25842 8186 14340 8750 2579
2013-07-27 20958 22512 14811 1446 8101 9583 6478 9844 24286 7554 ... 3433 1031 1155 3054 3519 17800 6514 9016 5226 1542
2013-07-28 15830 18209 11878 1085 6508 8357 5318 8168 20885 6634 ... 2419 22 933 2552 2693 13714 5809 7691 4760 1116
2013-07-29 20208 31409 18226 2428 10700 11987 8172 12917 30896 14057 ... 6205 14705 1548 3464 7145 24360 8032 12672 8149 2583
2013-07-30 21805 33127 18641 2397 11094 12753 8631 12990 32133 15237 ... 6368 16022 1626 3723 8274 25293 7943 13129 8938 2759
2013-07-31 22651 32796 19130 2345 11481 12721 9021 13194 32410 15033 ... 6365 15286 1587 3732 7414 25738 8022 13403 8924 2697
2013-08-01 22618 33284 18951 2430 11323 12669 8119 12792 32199 15030 ... 6174 14498 1539 3705 6981 25408 7756 12927 8655 2691
2013-08-02 22664 32726 19222 2383 11058 12793 8628 12938 32473 13126 ... 6291 15690 1573 3706 7022 25788 8174 13059 8581 2617

35 rows × 382 columns

In [87]:
fixed_entries.plot( figsize=(20,10), legend=False)
Out[87]:
<matplotlib.axes._subplots.AxesSubplot at 0x10c3f3050>

A few more to fix...

In [91]:
fixed_entries.max().max()
Out[91]:
296483.0
In [92]:
fixed_entries.max().idxmax()
Out[92]:
'PAVONIA/NEWPORT'
In [93]:
fixed_entries['PAVONIA/NEWPORT']
Out[93]:
2013-06-29      9304
2013-06-30      9028
2013-07-01     19057
2013-07-02     19645
2013-07-03     19654
2013-07-04      8316
2013-07-05     16607
2013-07-06      9050
2013-07-07      6660
2013-07-08     19074
2013-07-09     19892
2013-07-10     19979
2013-07-11     19695
2013-07-12     18582
2013-07-13      9671
2013-07-14      7988
2013-07-15     19230
2013-07-16     19834
2013-07-17     19893
2013-07-18     20047
2013-07-19     19773
2013-07-20      9598
2013-07-21      7658
2013-07-22     19148
2013-07-23     19853
2013-07-24     20302
2013-07-25     19904
2013-07-26     19085
2013-07-27      9768
2013-07-28      7224
2013-07-29     18836
2013-07-30     19455
2013-07-31     18903
2013-08-01    296483
2013-08-02     17477
Name: PAVONIA/NEWPORT, dtype: float64

Well this is our august 1st data....

We did the August 1st analysis assuming it didn't break any trend, may have to go back now...

In [94]:
fixed_entries.xs(datetime(2013, 8, 1))['PAVONIA/NEWPORT'] = 0 # should not matter for July analysis, just for printing a graph
In [95]:
fixed_entries.plot( figsize=(20,10), legend=False)
Out[95]:
<matplotlib.axes._subplots.AxesSubplot at 0x10d92f6d0>
In [102]:
july_entries = fixed_entries[datetime(2013, 7, 1): datetime(2013, 7, 31)]
july_entries.index
Out[102]:
DatetimeIndex(['2013-07-01', '2013-07-02', '2013-07-03', '2013-07-04',
               '2013-07-05', '2013-07-06', '2013-07-07', '2013-07-08',
               '2013-07-09', '2013-07-10', '2013-07-11', '2013-07-12',
               '2013-07-13', '2013-07-14', '2013-07-15', '2013-07-16',
               '2013-07-17', '2013-07-18', '2013-07-19', '2013-07-20',
               '2013-07-21', '2013-07-22', '2013-07-23', '2013-07-24',
               '2013-07-25', '2013-07-26', '2013-07-27', '2013-07-28',
               '2013-07-29', '2013-07-30', '2013-07-31'],
              dtype='datetime64[ns]', freq=None, tz=None)
In [103]:
july_entries.head()
Out[103]:
1 AVE 103 ST 103 ST-CORONA 104 ST 110 ST 110 ST-CATHEDRL 110 ST-CPN 111 ST 116 ST 116 ST-COLUMBIA ... WESTCHESTER SQ WHITEHALL ST WHITLOCK AVE WILSON AVE WINTHROP ST WOODHAVEN BLVD WOODLAWN ROAD WORLD TRADE CTR YORK ST ZEREGA AVE
2013-07-01 20750 30153 17775 2320 9360 12233 7499 12359 29353 14116 ... 5697 13929 1492 3433 5452 23480 7606 12123 7623 2483
2013-07-02 21785 32221 18537 2404 11016 13158 7999 12842 30890 15667 ... 6109 15205 1558 3545 7081 25707 7886 13551 8548 2575
2013-07-03 22515 32573 18910 2440 11213 13468 8495 12472 31478 14785 ... 6021 12824 1586 3479 7172 26416 8017 12843 8101 2521
2013-07-04 16106 20503 12407 1316 6795 8477 6097 8546 20997 6380 ... 2713 8092 1062 2529 3167 16794 4839 6133 3693 1323
2013-07-05 17598 26329 17022 1971 9735 10436 7928 11283 27878 10367 ... 4790 14461 1352 3384 5715 21937 7489 9692 5709 2171

5 rows × 382 columns

Fix Exits

assuming similar errors for exits.

In [104]:
df_by_date['exits'].plot( figsize=(20,10), legend=False)
Out[104]:
<matplotlib.axes._subplots.AxesSubplot at 0x105fa6a50>
In [128]:
july_exits = df_by_date['exits'][datetime(2013, 7, 1): datetime(2013, 7, 31)]

july_exits.head()
Out[128]:
1 AVE 103 ST 103 ST-CORONA 104 ST 110 ST 110 ST-CATHEDRL 110 ST-CPN 111 ST 116 ST 116 ST-COLUMBIA ... WESTCHESTER SQ WHITEHALL ST WHITLOCK AVE WILSON AVE WINTHROP ST WOODHAVEN BLVD WOODLAWN ROAD WORLD TRADE CTR YORK ST ZEREGA AVE
2013-07-01 21632 20991 12876 1672 8442 6193 4614 8974 17145 6863 ... 5301 11724 1412 2942 1435 15620 1741 6424 6734 1697
2013-07-02 22092 21765 13549 1844 9222 6822 4591 9787 18128 7416 ... 5625 12739 1515 3055 1571 16747 1884 6951 7293 1723
2013-07-03 23638 21390 14120 1875 9462 6583 5637 9985 18234 6954 ... 5683 11143 1474 3240 1620 18059 1814 6836 7054 1726
2013-07-04 14587 12695 11350 1186 5972 3806 4330 7871 11698 2581 ... 3051 7744 941 2581 761 12081 1575 4226 3869 1015
2013-07-05 22166 18479 12134 1548 7858 5479 5502 8597 17121 5187 ... 4753 10788 1304 3034 1213 16152 1597 6415 5090 1511

5 rows × 382 columns

In [129]:
max_exits = july_exits.max()
max_exits.sort(ascending=False)
max_exits[0:15]
Out[129]:
FOREST HILLS-71    953165446
DYRE AVE           534128238
MARCY AVE          469749073
7 AVE              270122256
INWOOD-207 ST      133746335
ATLANTIC AVE        20262343
77 ST               16734397
EXCHANGE PLACE        341138
34 ST-PENN STA        165068
BEACH 90 ST           164206
42 ST-GRD CNTRL       152746
34 ST-HERALD SQ       120950
42 ST-TIMES SQ        108401
14 ST-UNION SQ        101576
86 ST                  92117
dtype: float64
In [130]:
max_exits_normalized = max_exits / 1e9
max_exits_normalized.sort(ascending=False)
max_exits_normalized[0:15]
Out[130]:
FOREST HILLS-71    0.953165
DYRE AVE           0.534128
MARCY AVE          0.469749
7 AVE              0.270122
INWOOD-207 ST      0.133746
ATLANTIC AVE       0.020262
77 ST              0.016734
EXCHANGE PLACE     0.000341
34 ST-PENN STA     0.000165
BEACH 90 ST        0.000164
42 ST-GRD CNTRL    0.000153
34 ST-HERALD SQ    0.000121
42 ST-TIMES SQ     0.000108
14 ST-UNION SQ     0.000102
86 ST              0.000092
dtype: float64
In [131]:
max_exits[0:8]
Out[131]:
FOREST HILLS-71    953165446
DYRE AVE           534128238
MARCY AVE          469749073
7 AVE              270122256
INWOOD-207 ST      133746335
ATLANTIC AVE        20262343
77 ST               16734397
EXCHANGE PLACE        341138
dtype: float64
In [132]:
exit_stations_to_fix = max_exits[0:8].index.values.tolist()
exit_stations_to_fix
Out[132]:
['FOREST HILLS-71',
 'DYRE AVE',
 'MARCY AVE',
 '7 AVE',
 'INWOOD-207 ST',
 'ATLANTIC AVE',
 '77 ST',
 'EXCHANGE PLACE']

We need to fix 0:8

In [133]:
for station in july_exits[exit_stations_to_fix]:
    print(station, july_exits[station].max(), july_exits[station].idxmax(), july_exits[station].mean())
    
    july_exits.xs(july_exits[station].idxmax())[station] = 0
    # set to mean?
    july_exits.xs(july_exits[station].idxmax())[station] = july_exits[station].mean()
july_exits
('FOREST HILLS-71', 953165446.0, Timestamp('2013-07-01 00:00:00'), 31295824.036671363)
('DYRE AVE', 534128238.0, Timestamp('2013-07-04 00:00:00'), 17233653.730159745)
('MARCY AVE', 469749073.0, Timestamp('2013-07-10 00:00:00'), 15278417.256069878)
('7 AVE', 270122256.0, Timestamp('2013-07-19 00:00:00'), 8720399.1640535258)
('INWOOD-207 ST', 133746335.0, Timestamp('2013-07-12 00:00:00'), 4318940.5615556426)
('ATLANTIC AVE', 20262343.0, Timestamp('2013-07-29 00:00:00'), 672849.87352653453)
('77 ST', 16734397.0, Timestamp('2013-07-14 00:00:00'), 569126.11983610527)
('EXCHANGE PLACE', 341138.0, Timestamp('2013-07-30 00:00:00'), 23618.548387096773)
Out[133]:
1 AVE 103 ST 103 ST-CORONA 104 ST 110 ST 110 ST-CATHEDRL 110 ST-CPN 111 ST 116 ST 116 ST-COLUMBIA ... WESTCHESTER SQ WHITEHALL ST WHITLOCK AVE WILSON AVE WINTHROP ST WOODHAVEN BLVD WOODLAWN ROAD WORLD TRADE CTR YORK ST ZEREGA AVE
2013-07-01 21632 20991 12876 1672 8442 6193 4614 8974 17145 6863 ... 5301 11724 1412 2942 1435 15620 1741 6424 6734 1697
2013-07-02 22092 21765 13549 1844 9222 6822 4591 9787 18128 7416 ... 5625 12739 1515 3055 1571 16747 1884 6951 7293 1723
2013-07-03 23638 21390 14120 1875 9462 6583 5637 9985 18234 6954 ... 5683 11143 1474 3240 1620 18059 1814 6836 7054 1726
2013-07-04 14587 12695 11350 1186 5972 3806 4330 7871 11698 2581 ... 3051 7744 941 2581 761 12081 1575 4226 3869 1015
2013-07-05 22166 18479 12134 1548 7858 5479 5502 8597 17121 5187 ... 4753 10788 1304 3034 1213 16152 1597 6415 5090 1511
2013-07-06 16604 13614 6364 1286 6930 3985 4670 7297 13063 2996 ... 3382 721 1148 2738 692 12969 1636 5162 4845 1112
2013-07-07 15789 12271 3711 1102 6130 3916 4394 6333 12098 2963 ... 2758 37 1124 2526 692 10540 1297 3785 4885 1065
2013-07-08 21132 20560 11137 1694 9103 6366 5480 8938 17457 7028 ... 5307 11495 1501 3097 1607 15987 1767 7017 7590 1777
2013-07-09 22617 23012 14630 1942 9691 6585 5445 10034 18855 7641 ... 5989 12294 1436 3366 1728 17042 370 6697 7592 1905
2013-07-10 23368 22215 14410 1921 9508 6660 5626 10013 19094 7591 ... 5897 12652 1535 3381 1711 16212 NaN 6807 7639 1870
2013-07-11 23644 23278 14380 1951 9850 6645 5059 10099 18726 7319 ... 5976 12732 1626 3183 1805 32434 NaN 7105 8267 1723
2013-07-12 25527 23293 14264 1821 9542 6532 4834 10297 19158 6857 ... 5754 12438 1546 3380 1501 17326 NaN 6601 7502 1787
2013-07-13 20268 15174 11854 1329 7524 3945 4365 8147 14840 2916 ... 3672 10860 1002 2686 852 12388 9534 4946 5052 1166
2013-07-14 18286 14321 10522 1034 7103 3618 5366 7333 13585 2712 ... 3006 10249 915 2544 634 10397 1443 3805 4848 1015
2013-07-15 21560 23089 13494 1678 9466 5929 5571 8960 18381 6916 ... 5133 13265 1450 3311 2013 15949 1524 7074 7511 1802
2013-07-16 22882 22667 14342 1906 9651 6623 5909 9780 19139 8125 ... 5647 12956 1525 3269 1762 16260 2011 6837 7783 1787
2013-07-17 22955 24124 14872 1983 9707 6541 5872 10063 19022 7082 ... 5799 13579 1599 3492 1699 16765 1911 6941 7639 1798
2013-07-18 23289 23904 14206 1899 9874 6769 6042 10146 18771 7892 ... 5503 13484 1616 2926 1682 16893 1974 7027 7871 1859
2013-07-19 24340 23100 14054 1759 9551 6181 5841 9947 18345 6335 ... 5507 12900 1492 3201 1687 16737 2019 6673 7703 1697
2013-07-20 19487 15721 12927 1255 7191 4376 4610 8370 13897 3198 ... 3612 8740 1141 2756 927 12765 1687 5012 5353 1148
2013-07-21 17973 13493 10355 1086 6188 4192 4358 7562 12803 3163 ... 2873 7774 978 2452 784 9949 1508 4562 5745 956
2013-07-22 22366 21841 12108 1753 8968 6221 5273 9212 17760 7102 ... 5528 12918 1478 3467 1866 15262 1765 7099 7552 1749
2013-07-23 22460 23502 13139 1984 9099 6443 5999 10006 18626 7405 ... 5878 13281 1482 3137 1772 16189 1840 6806 7480 1763
2013-07-24 22736 24463 13570 2012 9754 6633 6110 10113 18827 6842 ... 5988 15061 1569 3357 1634 16977 1810 6764 7800 1885
2013-07-25 23841 23411 14348 1970 9633 6637 5903 10316 19293 7213 ... 5990 13971 1605 3365 1780 16862 1830 6682 8086 1800
2013-07-26 23807 22838 14192 1873 9769 6273 7289 9609 19152 6554 ... 5753 13653 1570 3284 1649 17193 1873 8091 7401 1649
2013-07-27 20349 16352 12431 1354 7153 4559 6746 8463 14113 3399 ... 3798 1053 1264 2905 898 13254 2432 6019 6196 1196
2013-07-28 18872 14139 10417 1094 6025 4355 6041 7103 13147 3105 ... 2876 102 1014 2566 779 10426 1410 4638 5170 895
2013-07-29 21140 21766 12951 1817 8972 6113 7663 9048 17843 6952 ... 5576 12812 1423 3260 2023 15322 2093 6583 7439 1690
2013-07-30 22453 23169 14175 1819 9501 6472 8473 9859 19205 7250 ... 5949 13468 1508 3482 1842 16146 1838 6707 7617 1714
2013-07-31 22852 22375 14861 1904 9663 6060 8878 9877 18913 7195 ... 5859 13201 1503 3495 1798 16547 1854 6596 7935 1815

31 rows × 382 columns

In [134]:
july_exits.plot( figsize=(20,10), legend=False)
Out[134]:
<matplotlib.axes._subplots.AxesSubplot at 0x112d32d90>

one more to fix

In [136]:
july_exits.max().max()
Out[136]:
548551.58505845966
In [138]:
july_exits.max().idxmax()
Out[138]:
'FOREST HILLS-71'
In [139]:
july_exits['FOREST HILLS-71']
Out[139]:
2013-07-01         0.000000
2013-07-02     11075.000000
2013-07-03     16983.136812
2013-07-04      5023.000000
2013-07-05      8754.000000
2013-07-06      5695.000000
2013-07-07      4749.000000
2013-07-08    548551.585058
2013-07-09     11312.000000
2013-07-10     11445.000000
2013-07-11     11592.000000
2013-07-12     10516.000000
2013-07-13      6292.000000
2013-07-14      5491.000000
2013-07-15     10840.000000
2013-07-16     10981.000000
2013-07-17     11101.000000
2013-07-18     11034.000000
2013-07-19     10548.000000
2013-07-20      6270.000000
2013-07-21      5155.000000
2013-07-22     11176.000000
2013-07-23     10763.000000
2013-07-24     11406.000000
2013-07-25     11372.000000
2013-07-26     10316.000000
2013-07-27      6714.000000
2013-07-28      5323.000000
2013-07-29     10636.000000
2013-07-30     11456.000000
2013-07-31     11737.000000
Name: FOREST HILLS-71, dtype: float64
In [140]:
july_exits.xs(datetime(2013, 7, 8))['FOREST HILLS-71'] = 0
In [141]:
july_exits.plot( figsize=(20,10), legend=False)
Out[141]:
<matplotlib.axes._subplots.AxesSubplot at 0x1140e8e50>

Prep July data, with 'Busy-ness'

In [142]:
july_busyness = july_exits + july_entries
In [146]:
july_exits.head()
Out[146]:
1 AVE 103 ST 103 ST-CORONA 104 ST 110 ST 110 ST-CATHEDRL 110 ST-CPN 111 ST 116 ST 116 ST-COLUMBIA ... WESTCHESTER SQ WHITEHALL ST WHITLOCK AVE WILSON AVE WINTHROP ST WOODHAVEN BLVD WOODLAWN ROAD WORLD TRADE CTR YORK ST ZEREGA AVE
2013-07-01 21632 20991 12876 1672 8442 6193 4614 8974 17145 6863 ... 5301 11724 1412 2942 1435 15620 1741 6424 6734 1697
2013-07-02 22092 21765 13549 1844 9222 6822 4591 9787 18128 7416 ... 5625 12739 1515 3055 1571 16747 1884 6951 7293 1723
2013-07-03 23638 21390 14120 1875 9462 6583 5637 9985 18234 6954 ... 5683 11143 1474 3240 1620 18059 1814 6836 7054 1726
2013-07-04 14587 12695 11350 1186 5972 3806 4330 7871 11698 2581 ... 3051 7744 941 2581 761 12081 1575 4226 3869 1015
2013-07-05 22166 18479 12134 1548 7858 5479 5502 8597 17121 5187 ... 4753 10788 1304 3034 1213 16152 1597 6415 5090 1511

5 rows × 382 columns

In [147]:
july_entries.head()
Out[147]:
1 AVE 103 ST 103 ST-CORONA 104 ST 110 ST 110 ST-CATHEDRL 110 ST-CPN 111 ST 116 ST 116 ST-COLUMBIA ... WESTCHESTER SQ WHITEHALL ST WHITLOCK AVE WILSON AVE WINTHROP ST WOODHAVEN BLVD WOODLAWN ROAD WORLD TRADE CTR YORK ST ZEREGA AVE
2013-07-01 20750 30153 17775 2320 9360 12233 7499 12359 29353 14116 ... 5697 13929 1492 3433 5452 23480 7606 12123 7623 2483
2013-07-02 21785 32221 18537 2404 11016 13158 7999 12842 30890 15667 ... 6109 15205 1558 3545 7081 25707 7886 13551 8548 2575
2013-07-03 22515 32573 18910 2440 11213 13468 8495 12472 31478 14785 ... 6021 12824 1586 3479 7172 26416 8017 12843 8101 2521
2013-07-04 16106 20503 12407 1316 6795 8477 6097 8546 20997 6380 ... 2713 8092 1062 2529 3167 16794 4839 6133 3693 1323
2013-07-05 17598 26329 17022 1971 9735 10436 7928 11283 27878 10367 ... 4790 14461 1352 3384 5715 21937 7489 9692 5709 2171

5 rows × 382 columns

In [148]:
july_busyness.head()
Out[148]:
1 AVE 103 ST 103 ST-CORONA 104 ST 110 ST 110 ST-CATHEDRL 110 ST-CPN 111 ST 116 ST 116 ST-COLUMBIA ... WESTCHESTER SQ WHITEHALL ST WHITLOCK AVE WILSON AVE WINTHROP ST WOODHAVEN BLVD WOODLAWN ROAD WORLD TRADE CTR YORK ST ZEREGA AVE
2013-07-01 42382 51144 30651 3992 17802 18426 12113 21333 46498 20979 ... 10998 25653 2904 6375 6887 39100 9347 18547 14357 4180
2013-07-02 43877 53986 32086 4248 20238 19980 12590 22629 49018 23083 ... 11734 27944 3073 6600 8652 42454 9770 20502 15841 4298
2013-07-03 46153 53963 33030 4315 20675 20051 14132 22457 49712 21739 ... 11704 23967 3060 6719 8792 44475 9831 19679 15155 4247
2013-07-04 30693 33198 23757 2502 12767 12283 10427 16417 32695 8961 ... 5764 15836 2003 5110 3928 28875 6414 10359 7562 2338
2013-07-05 39764 44808 29156 3519 17593 15915 13430 19880 44999 15554 ... 9543 25249 2656 6418 6928 38089 9086 16107 10799 3682

5 rows × 382 columns

In [149]:
july_busyness.plot(figsize=(20,10), legend=False)
Out[149]:
<matplotlib.axes._subplots.AxesSubplot at 0x1125ca990>

Analysis

3. What were the busiest and least-busy stations in the system over all of July 2013?

In [150]:
max_stations = july_busyness.mean()
max_stations.sort(ascending=False)
max_stations.head()
Out[150]:
34 ST-PENN STA     234276.829372
42 ST-GRD CNTRL    205542.165942
34 ST-HERALD SQ    198777.354839
42 ST-TIMES SQ     179415.419355
14 ST-UNION SQ     178422.741935
dtype: float64
In [151]:
max_stations.tail()
Out[151]:
BEACH 44 ST       967.064516
TOMPKINSVILLE     568.903226
ORCHARD BEACH     549.322581
BROAD CHANNEL     256.935484
AQUEDUCT TRACK      8.645161
dtype: float64

Analysis of 3

This is consistent with our previous analysis. The stations which have a lot of transfering stations and are located in central areas are the busiest

4. Which station had the highest average number of entries between midnight & 4am on Fridays in July 2013?

We will have to reconstruct this using the data above. Good thing is that we know the bad days of data

Previous fixes

Entries

Station day
PAVONIA/NEWPORT 2013-08-01
DYRE AVE 2013-07-17
7 AVE 2013-07-09
INWOOD-207 ST 2013-07-18
77 ST 2013-07-23
FOREST HILLS-71 2013-07-11
BEACH 98 ST 2013-07-16
MARCY AVE 2013-08-01
34 ST-PENN STA 2013-07-22
MORGAN AVE 2013-07-10
ATLANTIC AVE 2013-07-18
42 ST-GRD CNTRL 2013-07-09

Exits

Station day
FOREST HILLS-71 2013-07-08
FOREST HILLS-71 2013-07-01
DYRE AVE 2013-07-04
MARCY AVE 2013-07-10
7 AVE 2013-07-19
INWOOD-207 ST 2013-07-12
ATLANTIC AVE 2013-07-29
77 ST 2013-07-14
EXCHANGE PLACE 2013-07-30
In [159]:
from datetime import timedelta
In [160]:
date_start = datetime(2013, 7, 1)
date_end = datetime(2013, 7, 31)

fridays_in_july = []

current_date = date_start

while current_date <= date_end:
    
    if current_date.strftime("%A") == 'Friday':
        print('Friday:', current_date)
        fridays_in_july.append(current_date)
#     print(current_date, current_date.day, current_date.strftime("%A"))
    current_date += timedelta(days=1)
('Friday:', datetime.datetime(2013, 7, 5, 0, 0))
('Friday:', datetime.datetime(2013, 7, 12, 0, 0))
('Friday:', datetime.datetime(2013, 7, 19, 0, 0))
('Friday:', datetime.datetime(2013, 7, 26, 0, 0))
In [161]:
fridays_in_july
Out[161]:
[datetime.datetime(2013, 7, 5, 0, 0),
 datetime.datetime(2013, 7, 12, 0, 0),
 datetime.datetime(2013, 7, 19, 0, 0),
 datetime.datetime(2013, 7, 26, 0, 0)]

Lets see if any of those dates oversect with the ones we knew may have errors...

date fix
2013-07-12 INWOOD-207 ST
2013-07-19 7th Ave

Ok that's manageable. Let's see if it is even possible for these to be among the highest

In [177]:
inwood_data = mta.df[(mta.df.station == 'INWOOD-207 ST') & (mta.df.date == datetime(2013, 7,12))]

inwood_data[
    (inwood_data.datetime >= datetime(2013, 7, 12, 0, 0, 0)) & (inwood_data.datetime < datetime(2013, 7, 12, 4, 0, 0))
][['datetime', 'entries', 'exits']]
Out[177]:
datetime entries exits
231391 2013-07-12 01:00:00 90 219
231433 2013-07-12 01:00:00 75 144
231475 2013-07-12 01:00:00 121 225
231517 2013-07-12 01:00:00 1 0
231559 2013-07-12 01:00:00 0 0
229290 2013-07-12 01:00:00 0 0
229332 2013-07-12 01:00:00 0 0
229374 2013-07-12 01:00:00 68 169
229416 2013-07-12 01:00:00 12 45
229458 2013-07-12 01:00:00 35 94
229726 2013-07-12 00:05:25 5 0
229727 2013-07-12 00:22:57 2 0
229728 2013-07-12 00:51:09 3 0
229729 2013-07-12 00:52:52 2 0
229730 2013-07-12 01:00:00 1 0

OK no need to fix any data there

In [178]:
seventh_ave = mta.df[(mta.df.station == '7th Ave') & (mta.df.date == datetime(2013, 7,12))]

seventh_ave[
    (seventh_ave.datetime >= datetime(2013, 7, 12, 0, 0, 0)) & (seventh_ave.datetime < datetime(2013, 7, 12, 4, 0, 0))
][['datetime', 'entries', 'exits']]
Out[178]:
datetime entries exits

and no data is relevant there

In [299]:
fridays_in_july
Out[299]:
[datetime.datetime(2013, 7, 5, 0, 0),
 datetime.datetime(2013, 7, 12, 0, 0),
 datetime.datetime(2013, 7, 19, 0, 0),
 datetime.datetime(2013, 7, 26, 0, 0)]
In [301]:
count = 0
for date in fridays_in_july:
    count += len(mta.df[mta.df.date == date])
data_for_days = mta.df[mta.df.date.isin(fridays_in_july)]
data_for_days.count().max() == count
Out[301]:
True
In [303]:
data_for_days.date.unique()
Out[303]:
array(['2013-07-04T20:00:00.000000000-0400',
       '2013-07-11T20:00:00.000000000-0400',
       '2013-07-18T20:00:00.000000000-0400',
       '2013-07-25T20:00:00.000000000-0400'], dtype='datetime64[ns]')
In [309]:
data_for_days['hour'] = data_for_days.datetime.apply(lambda x: x.hour)
/usr/local/lib/python2.7/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
In [311]:
data_for_days[['hour', 'datetime']].head()
Out[311]:
hour datetime
168703 0 2013-07-05 00:00:00
168705 4 2013-07-05 04:00:00
168704 4 2013-07-05 04:00:00
168706 8 2013-07-05 08:00:00
168707 12 2013-07-05 12:00:00
In [312]:
fridays_morning_data = data_for_days[
    (data_for_days.hour >= 0) & (data_for_days.hour <= 4)
]
len(fridays_morning_data)
Out[312]:
32982
In [322]:
friday_morning_busyness = {}

for (station, hour), data in fridays_morning_data.groupby(['station', 'hour']):
    
    if station not in friday_morning_busyness:
        friday_morning_busyness[station] = {}
        
    friday_morning_busyness[station][hour] = data.entries.sum() #+ data.exits.sum()
friday_morning_busyness = pd.DataFrame(friday_morning_busyness)
friday_morning_busyness
Out[322]:
1 AVE 103 ST 103 ST-CORONA 104 ST 110 ST 110 ST-CATHEDRL 110 ST-CPN 111 ST 116 ST 116 ST-COLUMBIA ... WESTCHESTER SQ WHITEHALL ST WHITLOCK AVE WILSON AVE WINTHROP ST WOODHAVEN BLVD WOODLAWN ROAD WORLD TRADE CTR YORK ST ZEREGA AVE
0 16116 NaN 3684 479 4002 NaN 3106 3126 NaN NaN ... 1847 NaN 520 1336 NaN 1709 2762 5312 NaN NaN
1 NaN 9173 NaN NaN NaN 4608 NaN NaN 9899 3712 ... NaN 4424 NaN NaN 1160 4214 NaN NaN 4228 478
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 4329 NaN 1733 161 769 NaN 1492 667 NaN NaN ... 362 NaN 145 264 NaN 332 645 711 NaN NaN

5 rows × 382 columns

Note, remember this data comes in once every ~4 hours, unless there is a maintenance event, so makes sense only 1-2 values per col

In [323]:
friday_morning_busyness.plot(legend=False, figsize=(20,10))
Out[323]:
<matplotlib.axes._subplots.AxesSubplot at 0x14f914910>
In [329]:
most_busy = friday_morning_busyness.sum() / len(fridays_in_july)
most_busy.sort(ascending=False)
most_busy.head()
Out[329]:
42 ST-TIMES SQ     27236.75
34 ST-PENN STA     21425.00
34 ST-HERALD SQ    20684.50
42 ST-PA BUS TE    16124.25
14 ST-UNION SQ     15358.50
dtype: float64

Using the "highest avg", 42 ST-TIMES SQ had the highest average number of entries between midnight & 4am on Fridays in July 2013

In [334]:
friday_morning_busyness[most_busy[0:10].index].plot(legend=True, figsize=(20,10))
Out[334]:
<matplotlib.axes._subplots.AxesSubplot at 0x1385e6850>
In [332]:
friday_morning_busyness[most_busy.tail().index].plot(legend=True, figsize=(20,10))
Out[332]:
<matplotlib.axes._subplots.AxesSubplot at 0x14ecd5590>

Analysis of 4

This analysis follows all of the others, as stations that are most entered are central hubs which have many train connections. I will also add these are geographically correlated with many of the busy / happening areas in the city, which makes sense because the MTA should have built stations where people needed to go, and people would move to these areas. Its recurssive!

In [ ]: